{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "00-covertype-data-generation.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "sj_pbQzwpiUS",
        "colab_type": "text"
      },
      "source": [
        "# Generating Skewed Data for Prediction\n",
        "\n",
        "This notebook helps generating skewed data based on the [covertype](https://archive.ics.uci.edu/ml/datasets/covertype) dataset from UCI Machine Learning Repository. The generated data is then used to simulate online prediction request workload to a deployed model version on the AI Platform Prediction.\n",
        "\n",
        "The notebook covers the following steps:\n",
        "1. Download the data\n",
        "2. Define dataset metadata\n",
        "3. Sample unskewed data points\n",
        "4. Prepare skewed data points\n",
        "5. Simulate serving workload to AI Platform Prediction\n",
        "\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NDOuIyjJpxQa",
        "colab_type": "text"
      },
      "source": [
        "## Setup"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "022O03X5mNQk",
        "colab_type": "text"
      },
      "source": [
        "### Install packages and dependencies"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "0sGt6PC9Dbti",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!pip install -U -q google-api-python-client\n",
        "!pip install -U -q pandas"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "eSxqd3WamCuo",
        "colab_type": "text"
      },
      "source": [
        "### Setup your GCP Project"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "foxVB6rpmX_p",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "PROJECT_ID = 'sa-data-validation'\n",
        "BUCKET =  'sa-data-validation'\n",
        "REGION = 'us-central1'\n",
        "!gcloud config set project $PROJECT_ID"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2k8aDgXCmVdD",
        "colab_type": "text"
      },
      "source": [
        "### Authenticate your GCP account\n",
        "\n",
        "This is required if you run the notebook in Colab"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hqE7KOXBDSMj",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "try:\n",
        "  from google.colab import auth\n",
        "  auth.authenticate_user()\n",
        "  print(\"Colab user is authenticated.\")\n",
        "except: pass"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "TQBn3G_lmkvW",
        "colab_type": "text"
      },
      "source": [
        "### Import libraries"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1Kgv1ijYI3bV",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import os\n",
        "from tensorflow import io as tf_io\n",
        "import matplotlib.pyplot as plt\n",
        "import pandas as pd\n",
        "import numpy as np"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I2NMCH8AmpMm",
        "colab_type": "text"
      },
      "source": [
        "### Define constants\n",
        "\n",
        "You can change the default values for the following constants"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "w3SMI1H_DknU",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "LOCAL_WORKSPACE = './workspace'\n",
        "LOCAL_DATA_DIR = os.path.join(LOCAL_WORKSPACE, 'data')\n",
        "LOCAL_DATA_FILE = os.path.join(LOCAL_DATA_DIR, 'train.csv')\n",
        "BQ_DATASET_NAME = 'data_validation'\n",
        "BQ_TABLE_NAME = 'covertype_classifier_logs'\n",
        "MODEL_NAME = 'covertype_classifier'\n",
        "VERSION_NAME = 'v1'\n",
        "MODEL_OUTPUT_KEY = 'probabilities'\n",
        "SIGNATURE_NAME = 'serving_default'"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8tH1C0Yjp1Rw",
        "colab_type": "text"
      },
      "source": [
        "## 1. Download Data\n",
        "\n",
        "The covertype dataset is preprocessed, split, and uploaded to uploaded to the `gs://workshop-datasets/covertype` public GCS location. \n",
        "\n",
        "We use this version of the preprocessed dataset in this notebook. For more information, see [Cover Type Dataset](https://github.com/GoogleCloudPlatform/mlops-on-gcp/tree/master/datasets/covertype)"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mPyEIzdTD1p0",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "if tf_io.gfile.exists(LOCAL_WORKSPACE):\n",
        "  print(\"Removing previous workspace artifacts...\")\n",
        "  tf_io.gfile.rmtree(LOCAL_WORKSPACE)\n",
        "\n",
        "print(\"Creating a new workspace...\")\n",
        "tf_io.gfile.makedirs(LOCAL_WORKSPACE)\n",
        "tf_io.gfile.makedirs(LOCAL_DATA_DIR)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "YDVDp0STEKB9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "!gsutil cp gs://workshop-datasets/covertype/data_validation/training/dataset.csv {LOCAL_DATA_FILE}\n",
        "!wc -l {LOCAL_DATA_FILE}"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zbDx6WekEKZP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "data = pd.read_csv(LOCAL_DATA_FILE)\n",
        "print(\"Total number of records: {}\".format(len(data.index)))\n",
        "data.sample(10).T"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PNPxgqVCqE7J",
        "colab_type": "text"
      },
      "source": [
        "## 2. Define Metadata"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rvRdZIuoETkU",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "HEADER = ['Elevation', 'Aspect', 'Slope','Horizontal_Distance_To_Hydrology',\n",
        "          'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',\n",
        "          'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm',\n",
        "          'Horizontal_Distance_To_Fire_Points', 'Wilderness_Area', 'Soil_Type',\n",
        "          'Cover_Type']\n",
        "\n",
        "TARGET_FEATURE_NAME = 'Cover_Type'\n",
        "\n",
        "FEATURE_LABELS = ['0', '1', '2', '3', '4', '5', '6']\n",
        "\n",
        "NUMERIC_FEATURE_NAMES = ['Aspect', 'Elevation', 'Hillshade_3pm', \n",
        "                         'Hillshade_9am', 'Hillshade_Noon', \n",
        "                         'Horizontal_Distance_To_Fire_Points',\n",
        "                         'Horizontal_Distance_To_Hydrology',\n",
        "                         'Horizontal_Distance_To_Roadways','Slope',\n",
        "                         'Vertical_Distance_To_Hydrology']\n",
        "\n",
        "CATEGORICAL_FEATURE_NAMES = ['Soil_Type', 'Wilderness_Area']\n",
        "\n",
        "FEATURE_NAMES = CATEGORICAL_FEATURE_NAMES + NUMERIC_FEATURE_NAMES\n",
        "\n",
        "HEADER_DEFAULTS = [[0] if feature_name in NUMERIC_FEATURE_NAMES + [TARGET_FEATURE_NAME] else ['NA'] \n",
        "                   for feature_name in HEADER]\n",
        "\n",
        "NUM_CLASSES = len(FEATURE_LABELS)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "BAriSB__6Jsr",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "for feature_name in CATEGORICAL_FEATURE_NAMES:\n",
        "  data[feature_name] = data[feature_name].astype(str)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "gejOSnS_E_lo",
        "colab_type": "text"
      },
      "source": [
        "## 3. Sampling Normal Data\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "kNMF9rkpE_tz",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "normal_data = data.sample(2000)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "WYlqANUdGw0I",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(20, 10))\n",
        "normal_data['Elevation'].plot.hist(bins=15, ax=axes[0][0], title='Elevation')\n",
        "normal_data['Aspect'].plot.hist(bins=15, ax=axes[0][1], title='Aspect')\n",
        "normal_data['Wilderness_Area'].value_counts(normalize=True).plot.bar(ax=axes[1][0], title='Wilderness Area')\n",
        "normal_data[TARGET_FEATURE_NAME].value_counts(normalize=True).plot.bar(ax=axes[1][1], title=TARGET_FEATURE_NAME)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Lrp50MOhFK2y",
        "colab_type": "text"
      },
      "source": [
        "## 4. Prepare Skewed Data\n",
        "We are going to introduce the following skews to the data:\n",
        "1. **Numerical Features**\n",
        " * *Elevation - Feature Skew*: Convert the unit of measure from meters to kilometers for 1% of the data points\n",
        " * *Aspect - Distribution Skew*: Decrease the value by randomly from 1% to 50%\n",
        "2. **Categorical Features**\n",
        " * *Wilderness_Area - Feature Skew*: Adding a new category \"Others\" for 1% of the data points\n",
        " * *Wilderness_Area - Distribution Skew*: Increase of of the frequency of \"Cache\" and \"Neota\" values by 25%\n",
        "\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aabdFtM-FLCT",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "skewed_data = data.sample(1000)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_nalaB83U75N",
        "colab_type": "text"
      },
      "source": [
        "### 4.1 Skewing numerical features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "YdFqotRGWtXp",
        "colab_type": "text"
      },
      "source": [
        "#### 4.1.1 Elevation Feature Skew"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "ROUARI9MU3w4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "ratio = 0.1\n",
        "size = int(len(skewed_data.index) * ratio)\n",
        "indexes = np.random.choice(skewed_data.index, size=size, replace=False)\n",
        "skewed_data['Elevation'][indexes] = skewed_data['Elevation'][indexes] // 1000"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "LdiF7ONLVuM_",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))\n",
        "normal_data['Elevation'].plot.hist(bins=15, ax=axes[0], title='Elevation - Normal')\n",
        "skewed_data['Elevation'].plot.hist(bins=15,  ax=axes[1], title='Elevation - Skewed')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "POXFPHdtWzSw",
        "colab_type": "text"
      },
      "source": [
        "#### 4.1.2 Aspect Distribution Skew"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "euB8qRnzWzaP",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "skewed_data['Aspect'] = skewed_data['Aspect'].apply(\n",
        "    lambda value: int(value * np.random.uniform(0.5, 0.99))\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "1Qga7opvY5Ir",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))\n",
        "normal_data['Aspect'].plot.hist(bins=15, ax=axes[0], title='Aspect - Normal')\n",
        "skewed_data['Aspect'].plot.hist(bins=15,  ax=axes[1], title='Aspect - Skewed')"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "RDwWYPTnfTZ1",
        "colab_type": "text"
      },
      "source": [
        "### 4.2 Skew categorical features"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "MuGay843fcQY",
        "colab_type": "text"
      },
      "source": [
        "#### 4.2.1 Wilderness Area Feature Skew\n",
        "Adding a new category \"Others\"\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "sIXorDZff4yS",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "skewed_data['Wilderness_Area'] = skewed_data['Wilderness_Area'].apply(\n",
        "    lambda value: 'Others' if np.random.uniform() <= 0.1 else value\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zz8hhWluhSp9",
        "colab_type": "text"
      },
      "source": [
        "#### 4.2.2 Wilderness Area Distribution Skew"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "aISRDXvrhSzR",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "skewed_data['Wilderness_Area'] = skewed_data['Wilderness_Area'].apply(\n",
        "    lambda value: 'Neota' if value in ['Rawah', 'Commanche'] and np.random.uniform() <= 0.25 else value\n",
        ")"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "VjnFXyERi941",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 5))\n",
        "normal_data['Wilderness_Area'].value_counts(normalize=True).plot.bar(ax=axes[0], title='Wilderness Area - Normal')\n",
        "skewed_data['Wilderness_Area'].value_counts(normalize=True).plot.bar(ax=axes[1], title='Wilderness Area - Skewed')\n"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Y0kQHXkdFGZJ",
        "colab_type": "text"
      },
      "source": [
        "## 5. Simulating serving workload"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "2TCilipgUKKf",
        "colab_type": "text"
      },
      "source": [
        "### 5.1 Implement the model API client"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "yEyoTL5wFlu9",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import googleapiclient.discovery\n",
        "import numpy as np\n",
        "\n",
        "service = googleapiclient.discovery.build('ml', 'v1')\n",
        "name = 'projects/{}/models/{}/versions/{}'.format(PROJECT_ID, MODEL_NAME, VERSION_NAME)\n",
        "print(\"Service name: {}\".format(name))\n",
        "\n",
        "def caip_predict(instance):\n",
        "  \n",
        "  request_body={\n",
        "      'signature_name': SIGNATURE_NAME,\n",
        "      'instances': [instance]\n",
        "      }\n",
        "\n",
        "  response = service.projects().predict(\n",
        "      name=name,\n",
        "      body=request_body\n",
        "\n",
        "  ).execute()\n",
        "\n",
        "  if 'error' in response:\n",
        "    raise RuntimeError(response['error'])\n",
        "\n",
        "  probability_list = [output[MODEL_OUTPUT_KEY] for output in response['predictions']]\n",
        "  classes = [FEATURE_LABELS[int(np.argmax(probabilities))] for probabilities in probability_list]\n",
        "  return classes"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "4t0vQlldRkDM",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import time\n",
        "\n",
        "def simulate_requests(data_frame):\n",
        "\n",
        "  print(\"Simulation started...\")\n",
        "  print(\"---------------------\")\n",
        "  print(\"Number of instances: {}\".format(len(data_frame.index)))\n",
        "\n",
        "  i = 0\n",
        "  for _, row in data_frame.iterrows():\n",
        "    instance = dict(row)\n",
        "    instance.pop(TARGET_FEATURE_NAME)\n",
        "    for k,v in instance.items():\n",
        "      instance[k] = [v]\n",
        "\n",
        "    predicted_class = caip_predict(instance)\n",
        "    i += 1\n",
        "    \n",
        "    print(\".\", end='')\n",
        "\n",
        "    if (i + 1) % 100 == 0:\n",
        "      print()\n",
        "      print(\"Sent {} requests.\".format(i + 1))\n",
        "\n",
        "    time.sleep(0.5)\n",
        "  print(\"\")\n",
        "  print(\"-------------------\")\n",
        "  print(\"Simulation finised.\")\n",
        "    "
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-wOg4Crppnrl",
        "colab_type": "text"
      },
      "source": [
        "### 5.2 Simulate AI Platform Prediction requests"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8w_CtO5xTwjS",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "simulate_requests(normal_data)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SaH2HYFskmu4",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "simulate_requests(skewed_data)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rLTUwJlgn9jk",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        ""
      ],
      "execution_count": 0,
      "outputs": []
    }
  ]
}